A Statistical Study of the WPT-03 Corpus

نویسندگان

  • Bruno Martins
  • Mário J. Silva
چکیده

This report presents a statistical study of WPT-03, a text corpus built from the pages of the “Portuguese Web” collected in the repository of the tumba! search engine. We give a statistical analysis of the textual contents available in the Portuguese Web, including size distributions, the language of the pages, and the terms they contain.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Statistical Translation Alignment with Compositionality Constraints

This article presents a method for aligning words between translations, that imposes a compositionality constraint on alignments produced with statistical translation models. Experiments conducted within the WPT-03 shared task on word alignment demonstrate the effectiveness of the proposed approach.

متن کامل

Hedges in English for Academic Purposes: A Corpus-based study of Iranian EFL learners

Hedges, as tools to express tentativeness and doubt, have been studied in plenty of research papers in the Iranian EFL research setting. However, their use in a learner corpus, portraying Iranian learner English, is in need of more research attention. With this end in view, this study aimed at investigating how Iranian EFL learners who have majored in English-related fields in Iran deployed hed...

متن کامل

Competitive Grouping in Integrated Phrase Segmentation and Alignment Model

This article describes the competitive grouping algorithm at the core of our Integrated Segmentation and Alignment (ISA) model. ISA extracts phrase pairs from a bilingual corpus without requiring the precalculated word alignment as many other phrase alignment models do. Experiments conducted within the WPT-05 shared task on statistical machine translation demonstrate the simplicity and effectiv...

متن کامل

Fault diagnosis of gearboxes using LSSVM and WPT

This paper concentrates on a new procedure which experimentally recognises gears and bearings faults of a typical gearbox system using a least square support vector machine (LSSVM). Two wavelet selection criteria Maximum Energy to Shannon Entropy ratio and Maximum Relative Wavelet Energy are used and compared to select an appropriate wavelet for feature extraction. The fault diagnosis method co...

متن کامل

A Comparative Study of Effects of Input-Based, Meaning-Based Output, and Traditional Instructions on EFL Learners’ Grammar Learning

This quasi-experimental study examined the effects of input-based, meaning-based output (MO) and traditional instruction (TI) on EFL learners’ grammar learning. To this end, 120 junior high school students were selected from 4 intact classes. Each class was assigned to an instructional condition, that is, textual enhancement (TE), input flood (IF), MO, and TI. Before the treatment, a multiple-c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004